Back

Informatics in Medicine Unlocked

Elsevier BV

Preprints posted in the last 7 days, ranked by how well they match Informatics in Medicine Unlocked's content profile, based on 21 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
How can AI be compatible with evidence-based medicine?: with an example of analysis of lung cancer recurrence

Usuzaki, T.; Matsunbo, E.; Inamori, R.

2026-04-25 radiology and imaging 10.64898/2026.04.17.26351114 medRxiv
Top 0.1%
8.6%
Show abstract

Despite the remarkable progress of artificial intelligence represented by large language models, how AI technologies can contribute to the construction of evidence in evidence-based medicine (EBM) remains an overlooked issue. Now, we need an AI that can be compatible with EBM. In the present paper, we aim to propose an example analysis that may contribute to this approach using variable Vision Transformer.

2
Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics 10.64898/2026.04.25.26351733 medRxiv
Top 0.5%
1.5%
Show abstract

Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

3
Pilot Feasibility Clinical Trial of Virtual Reality for Pain Management During Repeated Pediatric Laser Procedures: Study Protocol for a Randomized Clinical Trial

Armstrong, M.; Williams, H.; Fernandez Faith, E.; Ni, A.; Xiang, H.

2026-04-22 dermatology 10.64898/2026.04.21.26351381 medRxiv
Top 1%
0.7%
Show abstract

BackgroundLasers have wide applications in medicine and dermatology, but are associated with pain and anxiety, particularly in younger patients. Pain mitigation is often limited to topical anesthetics in the outpatient setting. Distraction techniques are limited by the need for ocular protection, which can include adhesive eye patches that can completely occlude vision. Virtual reality is effective at managing procedural pain and anxiety under other short medical procedures and is a promising tool for this population. ObjectiveThis trial aims to assess the safety, feasibility, and efficacy of Virtual Reality Pain Alleviation Therapeutic (VR-PAT) for pain management during outpatient laser procedures. Methods40 patients requiring outpatient laser therapy for at least two sessions will be recruited from a pediatric hospital in the midwestern United States for this crossover randomized, two-arm clinical trial with a 1:1 allocation ratio. During the first laser visit, the participant will be randomly assigned to either play the VR-PAT game during their procedure or wear the headset with a dark screen. Participants will answer questions about their pain (Numeric Rating Scale (NRS) 0-10), anxiety (State Trait Anxiety Inventory for Children, NRS 0-10, Modified Yale Preoperative Anxiety Scale (mYPAS)), and pain medication usage. Those playing the VR-PAT will additionally report simulator sickness symptoms and their experience playing the game. At their second laser visit, participants will crossover to the opposite intervention from their first visit. The primary outcomes are the difference in self-reported pain and anxiety between the two interventions. Feasibility outcomes include the proportion of screened patients who are eligible, consent, and complete both visits and adverse events reported. To evaluate the efficacy of pain reduction, composite scores of pain score, pain medication will be calculated for each laser visit. To evaluate the efficacy of anxiety reduction, the change of mYPAS scores will be compared between control and VR groups at each visit using Wilcoxon rank sum tests. All statistical analyses will follow the intention-to-treat principle in regard to intervention assignment at each visit. ResultsThe study was funded in January 2023 and began enrollment at that time. A total of n=44 participants were recruited and data collection was completed in November 2025, with n=40 subjects completing both visits. The sample was balanced with n=40 subjects using the intervention and participating in the control condition. The age range of the complete sample was 6 to 21 years at recruitment and was 55% female sex. Data analysis is in progress with final results planned for June 2026. ConclusionsFindings from this innovative randomized clinical trial will provide early evidence on the efficacy of the VR-PAT for reducing self-reported pain and anxiety during outpatient laser procedures. The results from this trial will inform a large-scale, multisite study. Trial RegistrationClinicalTrials.gov: NCT05645224 [https://clinicaltrials.gov/study/NCT05645224]

4
A Context-Aware Target Engagement and Pharmacodynamic Biomarker Resource to Accelerate Drug Discovery and Development

Yang, Y.; Zhao, L.; Orouji, S.; Zhu, Y.; Johnson, R. L.; Maxwell, D. S.; Mica, I.; Russell, K. P.; Al-lazikani, B.

2026-04-22 bioinformatics 10.64898/2026.04.19.719411 medRxiv
Top 1%
0.6%
Show abstract

Confirming target engagement in tumor experimental models remains a major challenge in oncology drug development. Pharmacodynamic biomarkers can help address this, but few systematic resources link drug targets to candidate biomarkers. We developed TargetTrace, a comprehensive resource to identify and prioritize pharmacodynamic biomarkers across nine key target classes, including transcription factors/cofactors, kinases, phosphatases, ubiquitin ligases, deubiquitinases, acetyltransferases, deacetylases, methyltransferases, and demethylases. Biomarker candidates were gathered from curated molecular interaction resources and refined using external annotations to improve accuracy. For enzyme targets with measurable substrate changes, we applied a two-agent large language model workflow, followed by manual review, to harmonize antibody information from the antibody resources and ensure that the selected biomarkers are measurable with existing laboratory tests. From more than 92,000 input interactions and over 2,300 targets, we compiled 71,323 target-biomarker relationships involving 2,270 potential drug targets, encompassing both transcription factor/cofactor-target gene and enzyme-substrate interactions. Commercial antibodies were available for over 1,400 biomarkers, supporting laboratory validation. This resource provides a structured and reusable resource for systematic identification and prioritization of pharmacodynamic biomarkers in oncology.

5
Assessing ageing, cognitive ability and freezing of gait in Parkinson's disease through integrated brain-heart network dynamics

Pitti, L.; Sitti, G.; Candia-Rivera, D.

2026-04-23 neurology 10.64898/2026.04.22.26351482 medRxiv
Top 2%
0.6%
Show abstract

Parkinson's Disease (PD) is a complex neurodegenerative disorder that manifests through systemic, large-scale physiological reorganizations. While research often focuses on region-specific neural changes, there is a growing need for multidomain approaches to capture the complexity of the disease and its clinical heterogeneity. This study proposes an analytical pipeline to evaluate Brain-Heart Interplay (BHI) as a novel systemic biomarker for neurodegeneration and healthy ageing. In this study we assessed BHI across three open-source datasets (EEG and ECG signals). We compared Healthy Young, Healthy Elderly, and PD patients in resting state to investigate the effects of ageing and cognitive performance. Additionally, we studied BHI trends in PD patients in the moment of freezing of gait (FOG). Methodologically, brain network organization was quantified using coherence-based EEG connectivity and graph theory, while heart activity was analyzed through Poincare plot-derived measures of cardiac autonomic activity. The coupling between these two systems was measured using the Maximal Information Coefficient to capture linear and non-linear dependencies between global cortical organization and cardiac autonomic outflow. The results demonstrate that BHI is a sensitive biomarker for detecting early multisystem dysfunction in both neurodegeneration and ageing. Furthermore, the identification of specific BHI trends during FOG onset suggests new opportunities for understanding the physiological mechanisms driving motor complications in PD. Our proposed pipeline provides a guiding tool for large-scale physiological assessment in clinical research.

6
Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations

de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.

2026-04-23 radiology and imaging 10.64898/2026.04.22.26351451 medRxiv
Top 2%
0.5%
Show abstract

Background: Medical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. Methods: We screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. Results: After data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. Conclusion: By making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.

7
DIRD+: A Browser-Based, Offline-First Clinical Platform for Diabetic Retinopathy Screening Using Edge AI Inference in Low-Resource Settings

Baier-Quezada, N.; Almendras, C.; Uribe-Hernandez, V.; Barrientos-Toledo, H.; Leiva-Fernandez, C.; Arrigo-Figueroa, M.; Brana-Pena, F.; Macilla-Leiva, A.; Lopez-Moncada, F.

2026-04-27 health informatics 10.64898/2026.04.26.26351745 medRxiv
Top 2%
0.3%
Show abstract

Background: Diabetic retinopathy (DR) is the leading cause of preventable blindness in working-age adults. In Chile, despite GES coverage since 2006, screening reaches only ~21% of the diabetic population under control. Chilean evidence shows that autonomous AI screening platforms have produced heterogeneous field results (sensitivity 40.8-100%, specificity 55.4%), while Ophthalmic Medical Technologists (TMOs) consistently achieve >97% sensitivity, suggesting AI is most effective as structured support for trained professionals rather than as an autonomous filter. Objective: We present DIRD+ (Diabetic Integrated Retinal Diagnosis), an open-source clinical platform that performs complete DR clinical workflows - patient management, AI-assisted lesion detection, clinical classification, annotation, and report generation - entirely within the web browser using WebAssembly-based inference, without transmitting patient data to any server. This work describes the system architecture and a preliminary technical validation. Methods: DIRD+ implements a six-stage inference pipeline using ONNX Runtime Web (v1.23) with SIMD and multi-thread optimizations, a pluggable clinical guideline engine (ICDR 2024, MINSAL Chile 2017), and a human-in-the-loop annotation workflow. A YOLOv26n detection model was trained on 500 pseudo-labeled APTOS 2019 images using the Annotix framework [11] and evaluated on the IDRiD test set (n=81 images). Results: Optic disc detection - the spatial calibration landmark - achieved AP=1.000 on IDRiD (IoU=0.1). Soft exudate detection achieved AP=0.243 (F1=0.364). Internal validation mAP50=0.578. Browser-based inference averaged 0.297 s/image (3.4 images/second) on CPU without GPU. Lesion detection performance reflects a first-generation model trained on 500 images; progressive improvement through collaborative annotation is ongoing. Conclusions: DIRD+ demonstrates that a complete offline-first DR clinical workflow can be deployed at zero cost within a standard web browser without server infrastructure or GPU. The pluggable guideline engine and human-in-the-loop architecture make DIRD+ a viable tool for TMO-assisted screening in connectivity-limited primary care settings.

8
Deep Learning-Based Detection of Focal Cortical Dysplasia in Children: External Validation of the MELD Graph and 3D-nnUNet pipelines

Dell'Orco, A.; De Vita, E.; D'Arco, F.; Lange, A.; Rüber, T.; Kaindl, A. M.; Wattjes, M. P.; Thomale, U. W.; Becker, L.-L.; Tietze, A.

2026-04-22 radiology and imaging 10.64898/2026.04.21.26351368 medRxiv
Top 2%
0.3%
Show abstract

Focal cortical dysplasias (FCDs) are one of the most common structural causes of drug-resistant epilepsy in children but are frequently subtle and difficult to detect on conventional MRI. Many automated lesion detection methods have therefore been proposed to support neuroradiological assessment. In this study, we externally validated two recently developed deep-learning approaches for FCD detection, MELD Graph and 3D-nnUNet, in a pediatric cohort. In this retrospective single-center study, brain MRI scans of 71 children evaluated for epilepsy were analyzed, including 35 MRI-positive patients with suspected FCD and 36 MRI-negative cases based on the primary radiology reports. Both models were applied to standard 3D T1-weighted and 3D FLAIR images. Detected lesions were reviewed by an experienced pediatric neuroradiologist and classified as true positive, false positive, or false negative. Clinical semiology and EEG findings were additionally evaluated for cases with false-positive detections. At the lesion level, MELD Graph achieved a precision of 0.85 and recall of 0.52, while 3D-nnUNet achieved a precision of 0.91 and recall of 0.48. In the MRI-negative patients, MELD Graph produced more false-positive detections than 3D-nnUNet (0.53 vs. 0.14 false-positive lesions per patient). At the patient level, MELD Graph showed slightly higher sensitivity than 3D-nnUNet (0.63 vs. 0.54), whereas 3D-nnUNet demonstrated markedly higher specificity (0.86 vs. 0.56). Improved FLAIR image quality was associated with trends toward improved model performance. Both models demonstrated high precision but moderate sensitivity, indicating that they are valuable decision-support tools but cannot replace expert neuroradiological evaluation. Optimized MRI acquisition protocols are needed to further improve automated lesion detection in pediatric epilepsy.

9
Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv
Top 2%
0.3%
Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

10
Resolution of systemic inflammation in psoriasis following herring roe oil treatment: a post hoc analysis on inflammatory biomarkers in non-severe psoriatic patients

Ringheim-Bakka, T. A.; Gammelsaeter, R.; Tveit, K. S.

2026-04-22 dermatology 10.64898/2026.04.20.26350934 medRxiv
Top 3%
0.3%
Show abstract

BackgroundPsoriasis is a chronic immune-mediated inflammatory disease (IMID) with systemic involvement. In mild-to-moderate disease, circulating cytokines may inadequately capture systemic inflammatory burden. Composite haematological indices derived from complete blood counts, such as the systemic immune-inflammation index (SII) and systemic inflammation response index (SIRI), have emerged as sensitive prognostic markers of systemic inflammation, including in psoriasis. This exploratory post hoc analysis investigated the effects of orally administered herring roe oil (HRO), a phospholipid-rich marine oil, on systemic inflammation in patients with mild-to-moderate psoriasis utilizing these biomarkers. MethodsData were analysed from a randomized, double-blind, placebo-controlled 26-week clinical study which investigated HRO supplementation in patients (N = 64) with mild-to-moderate psoriasis (NCT03359577). SII, SIRI, neutrophil-to-lymphocyte ratio (NLR), platelet-to-lymphocyte ratio (PLR), and monocyte-to-lymphocyte ratio (MLR) were calculated at baseline, week 12, and week 26 for patients where baseline complete blood counts (CBCs) were available (n = 60). Patients missing baseline CBCs were excluded from the analysis. Continuous changes were assessed using ANCOVA with baseline adjustment. Categorical responder analyses were performed with 25% and 30% reduction thresholds and stratification by baseline biomarker medians were performed to evaluate treatment responses and impact of baseline inflammation. ResultsCompared with placebo, HRO treatment resulted in significant mean reductions in SII, SIRI, and PLR at week 26, with supportive trends and responder effects observed as early as week 12 compared to placebo. Patients with elevated baseline inflammatory indices showed the greatest reductions in systemic inflammation. Stratification by baseline SII further revealed enhanced clinical benefit, with statistically significant PASI50 response rates in the HRO arm at week 26 among patients with lower baseline SII. ConclusionHRO supplementation was associated with a time{square}dependent reduction in systemic inflammatory biomarkers in mild{square}to{square}moderate psoriasis patients. These findings support the utility of composite inflammatory indices for monitoring systemic inflammation and suggest that baseline SII may have utility in predicting treatment response and may be a useful tool for stratification in clinical trials in mild to moderate psoriasis patients. These results could also suggest platform-potential of HRO for resolution{square}oriented interventions across several inflammatory conditions.

11
MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv
Top 3%
0.2%
Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

12
Reveal Principles of Codon Optimization via Machine Learning

Deng, F.; Li, H.; Sun, D.; Duan, G.; Sun, Z.; Xue, G.

2026-04-21 bioinformatics 10.64898/2026.04.16.718958 medRxiv
Top 3%
0.2%
Show abstract

High level of protein expression is usually welcomed in industry and research, and codon optimization is widely used to achieve high expression. Methods of implementing codon optimization can be divided into two branches, one is classical methods which develop cost functions based on empirical law, another is AI methods which learn the codon choice principles from endogenous genes with neural networks. Here we develop two codon optimization tools based on two branches respectively, namely OptimWiz 2.1 and OptimWiz 3.0. Results of fusion protein fluorescence detection indicate that both OptimWiz 2.1 and OptimWiz 3.0 are superior to all the other commercially available codon optimization tools. Principles of codon optimization are revealed in the process of machine learning on both tools.

13
Comparing prognostic performance and reasoning between large language models and physicians

Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.

2026-04-25 intensive care and critical care medicine 10.64898/2026.04.17.26350898 medRxiv
Top 3%
0.2%
Show abstract

Importance: Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective: To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design: Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting: The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants: We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures: For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results: Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance: In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression.

14
Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv
Top 3%
0.2%
Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

15
Identifying SARS-CoV-2 Lineages that Share the Same Relative Effective Reproduction Numbers

Musonda, R.; Ito, K.; Omori, R.; Ito, K.

2026-04-24 infectious diseases 10.64898/2026.04.22.26351531 medRxiv
Top 3%
0.2%
Show abstract

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continuously evolved since its emergence in the human population in 2019. As of 1st August 2025, more than 1,700 Omicron subvariants have been designated by the Pango nomenclature system. The Pango nomenclature system designates a new lineage based on genetic and epidemiological information of SARS-CoV-2 strains. However, there is a possibility that strains that have similar genetic backgrounds and the same phenotype are given different Pango lineage names. In this paper, we propose a new algorithm, called FindPart-w, which can identify groups of viral lineages that share the same relative effective reproduction numbers. We introduced a new lineage replacement model, called the constrained RelRe model, which constrains groups of lineages to have the same relative effective reproduction numbers. The FindPart-w algorithm searches the equality constraints that minimise the Akaike Information Criterion of constrained RelRe models. Using hypothetical observation count data created by simulation, we found that the FindPart-w algorithm can identify groups of lineages having the same relative effective reproduction number in a practical computational time. Applying FindPart-w to actual real-world data of time-stamped lineage counts from the United States, we found that the Pango lineage nomenclature system may have given different lineage names to SARS-CoV-2 strains even if they have the same relative effective reproduction number and similar genetic backgrounds. In conclusion, this study showed that viruses that had the same relative effective reproduction number were identifiable from temporal count data of viral sequences. These findings will contribute to the future development of lineage designation systems that consider both genetic backgrounds and transmissibilities of lineages.

16
Modality Fusion of MRI and Clinical Data for Glioma Tumour Grading

Kheirbakhsh, R.; Mathur, P.; Lawlor, A.

2026-04-22 health informatics 10.64898/2026.04.20.26351308 medRxiv
Top 3%
0.2%
Show abstract

Multimodal machine learning leverages complementary information from diverse data sources and has shown strong promise in medical imaging, where multimodal data is critical for clinical decision making. In glioma grading, integrating MRI modalities with clinical data can improve diagnostic accuracy, yet systematic comparisons of fusion strategies remain limited. This study evaluates early, intermediate, and late fusion approaches, addressing the question: How does the inclusion of clinical data alongside MRI modalities influence grading performance? To assess modality contributions, we design adaptable fusion layers and employ interpretability techniques, including attention-based analysis. Our results show that incorporating clinical data consistently outperforms unimodal and MRI-only baselines, with intermediate fusion yielding the most reliable gains. Beyond accuracy, the framework reveals how MRI and clinical features jointly shape predictions, underscoring the importance of both fusion design and interpretability for clinical adoption.

17
Oropouche, Dengue, and Chikungunya differential diagnosis. Development and validation of predictive models with surveillance data from Espirito Santo-Brazil.

Nickel Valerio, E. C.; Coli Seidel, G. M.; Da Silva Nunes, R.; Alvarenga Americano do Brasil, P. E.

2026-04-25 infectious diseases 10.64898/2026.04.17.26350875 medRxiv
Top 4%
0.1%
Show abstract

There is an ongoing Oropouche Fever (OF) outbreak in Brazil since 2024. There are dengue and chikungunya prediction models available, but none to help discriminate dengue, chikungunya, and OF. Objective: This study aims to develop and validate clinical prediction models for dengue, chikungunya, OF. Methods: This study uses surveillance data from Espirito Santo state / Brazil, from 2023-2025. Epidemiological investigations and biological samples were used to conclude cases as either (a) clinical-epidemiologically confirmed, (b) laboratory confirmed, or (c) discarded. The predictors were all data related to signs, symptoms, and comorbidities available in the notification forms. The analysis was performed using random forest regression models, one for each outcome, in development and validation datasets. Results: A total of 465,280 observations were analyzed, 261,691 dengue cases (56.6%), 18,676 chikungunya cases (4.0%), 12,174 OF cases (2.6%), and 179,115 discarded cases (38.6%). All three models had good discrimination and moderate to good calibration after scaling prediction. The models retained from 26 to 16 predictors each. Leukopenia and vomiting were the most discriminatory predictors for dengue, arthritis, arthralgia, and rash were the most discriminatory for chikungunya, and epidemiological features were the most relevant for OF. The dengue, chikungunya, and OF models had ROC AUC of 0.726, 0.851, and 0.896 in the validation set, respectively. Conclusion: This research identified predictors most discriminative between dengue, chikungunya, and OF. We developed and validated predictive models, one for each condition, with moderate to very good performance available at https://pedrobrasil.shinyapps.io/INDWELL/. One may use them in diagnostic work-up and arbovirus surveillance.

18
Pre-procedural testing using patient-specific models is associated with high training fidelity and improved procedural efficiency in endovascular aneurysm treatment

Hofmeister, J.; Bernava, G.; Rosi, A.; Brina, O.; Reymond, P.; Muster, M.; Lovblad, K.-O.; Machi, P.

2026-04-24 radiology and imaging 10.64898/2026.04.23.26351592 medRxiv
Top 4%
0.1%
Show abstract

Background: Even for experienced operators, endovascular treatment of unruptured intracranial aneurysms involves intraoperative uncertainty that may lead to adjustments in strategy, prolong the procedure, and potentially cause inefficiency and device waste. This study aimed to evaluate whether pre-procedural testing (PPT) of endovascular treatment using patient-specific models was associated with increased operator confidence and perceived clinical utility, including improvements in procedural efficiency and reduced resource waste. Methods: We enrolled a cohort of patients who underwent PPT before endovascular treatment for complex unruptured intracranial aneurysms and compared their outcomes with a control group treated without PPT. The primary outcome was the Training Fidelity Score, a composite of three operator-reported Likert items defined a priori. Secondary outcomes included perceived clinical utility, intraoperative strategy changes, procedural time, radiation exposure, device waste and safety. Results: A total of 85 patients met the inclusion criteria (PPT=40; control=45). The Training Fidelity Score was high across the PPT group (median, 4.33/5). Perceived clinical utility was high and further increased significantly after the procedure. A significant reduction was observed in intraoperative strategy changes, with no changes recorded in the PPT group, compared to 6/45 in the control group (RR 0.09; p=0.027). Reductions in treatment time, radiation exposure and device waste were also noted. Conclusion: PPT using patient-specific models was associated with increased operator confidence, fewer intraoperative strategy changes, improved procedural efficiency, and reduced device waste without compromising safety. These findings support its use in pre-interventional preparation, but require prospective multicenter validation.

19
Multicohort development and validation of a machine learning model to predict six-month functional traumatic brain injury outcomes in a large national registry

Vattipally, V. N.; Jillala, R. R.; Kramer, P.; Elshareif, M.; Singh, S.; Jo, J.; Suarez, J. I.; Sakran, J. V.; Haut, E. R.; Huang, J.; Bettegowda, C.; Azad, T. D.

2026-04-27 intensive care and critical care medicine 10.64898/2026.04.23.26351622 medRxiv
Top 4%
0.1%
Show abstract

Background: Prognostication after moderate-to-severe traumatic brain injury (TBI) rarely captures long-term functional recovery, despite its importance to patients, families, and clinicians. Large trauma registries such as the Trauma Quality Improvement Program (TQIP) dataset contain detailed clinical data but lack systematic follow-up, limiting their ability to study longer-term functional outcomes. Methods: We developed and externally validated a machine learning model to predict favorable six-month functional outcome (GOS MD/GR or GOSE >=5) using harmonized data from two randomized clinical trials: CRASH (training) and ROC-TBI (validation). Five candidate classifiers (random forest [RF], linear discriminant analysis, k-nearest neighbors, naive Bayes, and support vector machine) were trained using seven shared clinical predictors. Models were evaluated using ROC-AUC, calibration metrics, and performance at the Youden optimal threshold and a high-sensitivity secondary threshold. The final model was applied to patients with moderate-to-severe TBI in the national TQIP registry (2017-2022) to estimate population-level recovery patterns. Results: The RF model demonstrated the highest overall performance after recalibration, achieving strong discrimination (AUC internal and external, 0.887 and 0.784), good calibration, and high sensitivity (0.890) and negative predictive value (0.909). Applied to 63,289 patients from TQIP, the model estimated that 45% would achieve favorable six-month outcomes at the Youden optimal threshold and 57% at the high-sensitivity threshold, with predicted recovery aligning with established clinical correlates such as younger age, higher admission GCS, and lower rates of penetrating or brainstem injuries. Conclusion: A machine learning model trained on high-quality trial data can generate clinically plausible estimates of long-term functional recovery when applied at scale to national trauma registries that lack systematic follow-up. This approach enables imputation of functional outcomes in datasets lacking follow-up, supports benchmarking and quality improvement across trauma systems, and provides a foundation for future models incorporating physiologic time-series, imaging, and biomarker data.

20
Feature-Based Parametric Response Mapping on Thoracic Computed Tomography for Robust Disease Classification in COPD

Namvar, A.; Shan, B.; Hoff, B.; Labaki, W. W.; Murray, S.; Bell, A. J.; Galban, S.; Kazerooni, E. A.; Martinez, F. J.; Hatt, C. R.; Han, M. K.; Galban, C. J.; Ram, S.

2026-04-27 radiology and imaging 10.64898/2026.04.24.26351675 medRxiv
Top 4%
0.1%
Show abstract

Purpose: To develop an interpretable feature-based Deep Parametric Response Mapping (PRMD) method that combines wavelet scattering convolution networks and machine learning to spatially detect and quantify functional small airways disease (fSAD) and emphysema on paired inspiratory-expiratory CT scans, with enhanced noise robustness. Materials and Methods: In this retrospective analysis of prospectively acquired data (2007-2017), we developed and validated a deep learning-based PRM approach using paired CT scans from 8,972 tobacco-exposed COPDGene participants ([&ge;]10 pack-years; mean age 60.1 {+/-} 8.8 years; 46.5% women), including controls with normal spirometry (n = 3,872; controls), PRISm (n = 1,089), GOLD 1-4 COPD (n = 4,011). Data were stratified into training, validation, and testing sets (24:6:70). PRMD extracts translation-invariant image features using a wavelet scattering network and applies a subspace learning classifier to classify voxels as emphysema or non-emphysematous air trapping (fSAD). PRMD was compared with conventional density-based PRM for voxel-wise agreement, correlation with pulmonary function, robustness to noise, and sensitivity to misregistration using Pearson correlation, Bland-Altman analysis, and paired t tests. Results: PRMD achieved 95% voxel-wise agreement with standard PRM (r = 0.98) while demonstrating significantly greater robustness under noise. PRMD showed stronger correlations with FEV1; (emphysema: r = - 0.54; fSAD: r = - 0.51; P < 0.0001) than standard PRM (r = - 0.42 for both; P < 0.0001). Under simulated high-noise conditions, standard PRM overestimated disease by ~15%, whereas PRMD limited error to < 5% (P < 0.001). Conclusion: PRMD provides an interpretable, feature-driven and noise-resilient alternative to traditional PRM for emphysema and fSAD classification, enhancing the reliability of CT-based COPD phenotyping for multi-center studies and low-dose imaging applications.